{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import PreTrainedTokenizerFast, DataCollatorForLanguageModeling, PhiConfig, PhiForCausalLM, Trainer, TrainingArguments, TrainerCallback\n",
    "from datasets import load_dataset\n",
    "import pandas as pd\n",
    "import time\n",
    "import torch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. 数据来源，保存路径，最大长度定义"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer_dir = './model_save/tokenizer/'\n",
    "model_save_dir = './model_save/pre/'\n",
    "logs_dir = './logs/'\n",
    "train_file = './data/bell_pretrain_3M.parquet'\n",
    "max_seq_len = 400"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 加载训练好的tokenizer\n",
    "如果你使用的`add_tokens`方法添加了自己的token，必须要用`len(tokenizer)`获取长度，`tokenizer.vocab_size`统计不包含你添加的字符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "vicab size: 35840\n"
     ]
    }
   ],
   "source": [
    "tokenizer = PreTrainedTokenizerFast.from_pretrained(tokenizer_dir)\n",
    "print(f\"vicab size: {len(tokenizer)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. 加载数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = load_dataset(path='parquet', data_files=train_file, split='train', cache_dir='.cache')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['text'],\n",
       "    num_rows: 3033566\n",
       "})"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['请输入一段描述有关海洋保护的情景对话。\\nPerson A: 哇，这个海滩真美啊！\\nPerson B: 是啊，可惜现在海洋污染越来越严重了。\\nPerson A: 对啊，以前来海边能看到很多海星、海螺等生物，现在好难看到了。\\nPerson B: 这些可爱的生物其实都是海洋生态系统的重要组成部分，如果它们消失了，整个生态系统都会受影响。\\nPerson A: 我们该怎么做才能保护海洋呢？\\nPerson B: 首先，我们要遵守海滩的相关规定，不要随便乱扔垃圾。其次，减少使用一次性塑料制品，比如塑料袋、吸管等，这些东西会进入海洋，危害海洋生物。最后，我们可以积极参与各种海洋保护组织，呼吁更多人加入到保护海洋的行动中来。\\nPerson A: 听起来保护海洋是一项重要的任务，我们一定要做好自己的份内工作。\\nPerson B: 是的，每个人都能为保护海洋贡献自己的一份力量。', '为我寻找5个值得信任的汽车保险公司\\n1. State Farm\\n2. Geico\\n3. Allstate\\n4. Progressive\\n5. Farmers Insurance', 'A公司去年亏损了500万美元，今年净利润增长了50%，今年的净利润是多少？\\n今年净利润为750万美元', '介绍最适合初学者的编程语言\\nPython是最适合初学者学习的编程语言之一。Python具有以下优点：\\n1. 简单易学：Python拥有简单易懂的语法和基本代码结构，可以使初学者更快地掌握编程基础。\\n2. 应用广泛：Python可以用于很多不同的任务和领域，例如数据科学、人工智能、机器学习、网络编程等。\\n3. 强大的社区支持：Python具有庞大的开发者社区，可以获得丰富的资源和支持。\\n4. 多资源学习：Python的学习资源丰富，包括图书、视频、课程等。\\n总之，对于初学者来说，Python是一门非常好的编程语言，它容易入门，应用广泛，有强大的社区支持和丰富的学习资源。', '以下是一道小学数学题： \\n有一条直线，上面有 6 只蚂蚁，其中 4 只向左走，2 只向右走。问：向右走的蚂蚁数量占比是多少？\\nA. 25%\\nB. 33.3%\\nC. 50%\\nD. 66.6%\\n回答该问题。\\n向右走的蚂蚁数量占比为两只蚂蚁中向右走的蚂蚁数量与总蚂蚁数量的比值。\\n先算出总蚂蚁数量：6 只蚂蚁。\\n再算出向右走的蚂蚁数量：2 只蚂蚁。\\n将向右走的蚂蚁数量除以总蚂蚁数量，即得到向右走的蚂蚁数量占比。\\n计算式为：\\n2 ÷ 6 = 0.33\\n所以，向右走的蚂蚁数量占比为 33.3%，选 B。']\n"
     ]
    }
   ],
   "source": [
    "samples = dataset['text'][0:5]\n",
    "print(samples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## token to id缓存到文件，使用的时候不用再次tokenize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'input_ids': [[3, 1948, 1829, 1128, 8739, 20114, 225, 678, 10254, 221, 3985, 32137, 225, 177, 1], [3, 1933, 547, 10085, 24782, 32, 1596, 1791, 3259, 17514, 221, 5187, 424, 541, 4991, 7940, 9144, 221, 255, 3135, 2479, 2127, 225, 1]]}\n"
     ]
    }
   ],
   "source": [
    "def token_to_id(samples: dict[str, list]) -> dict:\n",
    "    batch_txt = []\n",
    "    for txt in samples['text']:\n",
    "        batch_txt.append(\n",
    "            f\"[BOS]{txt}[EOS]\"\n",
    "        )\n",
    "    outputs = tokenizer(\n",
    "        batch_txt,\n",
    "        truncation=False,\n",
    "        padding=False,\n",
    "        return_attention_mask=False,\n",
    "    )\n",
    "\n",
    "    return {\n",
    "        \"input_ids\": outputs[\"input_ids\"], \n",
    "        }\n",
    "\n",
    "print(token_to_id({'text':['判断给定的文章是否符合语法规则。如果不符合，请提供修改建议。\\n','下面是一篇文章的开头: \"为了探讨这个主题，本文将提供一系列数据和实例，以证明这一观点。']}))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['input_ids'],\n",
       "    num_rows: 3033566\n",
       "})"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "tokenized_datasets = dataset.map(\n",
    "    token_to_id, batched=True, batch_size=4096, remove_columns=dataset.column_names\n",
    ")\n",
    "tokenized_datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. 定义data_collator\n",
    "`mlm=False`表示要训练CLM模型，`mlm=True`表示要训练MLM模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "few_data = [tokenized_datasets[i] for i in range(5)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print(few_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  验证一下数据看padding、输入输出是否符合要求"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['input_ids', 'attention_mask', 'labels'])\n",
      "input_ids shape: torch.Size([5, 180])\n",
      "attention_mask shape: torch.Size([5, 180])\n",
      "labels shape: torch.Size([5, 180])\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "tensor(True)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "out = data_collator(few_data)\n",
    "print(out.keys())\n",
    "for key in out:\n",
    "    # print(out[key])\n",
    "    print(f\"{key} shape: {out[key].shape}\")\n",
    "\n",
    "# input_ids 和 labels 相同\n",
    "sum(out['input_ids'][0] == out['labels'][0]) == sum(out['attention_mask'][0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. 定义模型\n",
    "从`config`定义，不是`from_pretrained`。 \n",
    "为了方便cuda计算，词表的大小注意一下，如果不是64的整数倍，可以手动向上取整为64的整数倍，也可以是其他 $2^x$ 数值的整数倍，如32、128、256都行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "final vocab sieze: 35840\n"
     ]
    }
   ],
   "source": [
    "vocab_size = len(tokenizer)\n",
    "if vocab_size % 64 != 0:\n",
    "    vocab_size = (vocab_size // 64 + 1) * 64\n",
    "print(f\"final vocab sieze: {vocab_size}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Phi-2 size: 262.9M parameters\n"
     ]
    }
   ],
   "source": [
    "phi_config = PhiConfig(\n",
    "    vocab_size=vocab_size,\n",
    "    bos_token_id=tokenizer.bos_token_id,\n",
    "    eos_token_id=tokenizer.eos_token_id,\n",
    "    hidden_size=768,\n",
    "    num_attention_heads=12,\n",
    "    num_hidden_layers=24,\n",
    "    max_position_embeddings=512,\n",
    "    intermediate_size=4096,\n",
    ")\n",
    "\n",
    "model = PhiForCausalLM(phi_config)\n",
    "# model = model.to_bettertransformer()\n",
    "\n",
    "model_size = sum(t.numel() for t in model.parameters())\n",
    "print(f\"Phi-2 size: {model_size / 1000**2:.1f}M parameters\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. cuda cache回调函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "class EmptyCudaCacheCallback(TrainerCallback):\n",
    "    log_cnt = 0\n",
    "    def on_log(self, args, state, control, logs=None, **kwargs):\n",
    "        self.log_cnt += 1\n",
    "        if self.log_cnt % 5 == 0:\n",
    "            torch.cuda.empty_cache()\n",
    "            \n",
    "empty_cuda_cahce = EmptyCudaCacheCallback()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. 定义训练参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "args = TrainingArguments(\n",
    "    output_dir=model_save_dir,\n",
    "    per_device_train_batch_size=8,\n",
    "    gradient_accumulation_steps=8,\n",
    "    num_train_epochs=4,\n",
    "    weight_decay=0.1,\n",
    "    warmup_steps=1000,\n",
    "    learning_rate=5e-4,\n",
    "    save_steps=2000,\n",
    "    save_total_limit=3,\n",
    "    report_to='tensorboard',\n",
    "    optim=\"adafactor\",\n",
    "    bf16=True,\n",
    "    logging_steps=10,\n",
    "    log_level='info',\n",
    "    logging_first_step=True,\n",
    ")\n",
    "\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    tokenizer=tokenizer,\n",
    "    args=args,\n",
    "    data_collator=data_collator,\n",
    "    train_dataset=tokenized_datasets,\n",
    "    callbacks=[empty_cuda_cahce],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7. 开始训练\n",
    "`resume_from_checkpoint=True`参数可以从上次保存的检查点继续训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer.train(\n",
    "    # resume_from_checkpoint=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8. 最后保存训练的loss日志和模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Saving model checkpoint to ./model_save/pre/\n",
      "Configuration saved in ./model_save/pre/config.json\n",
      "Configuration saved in ./model_save/pre/generation_config.json\n",
      "Model weights saved in ./model_save/pre/pytorch_model.bin\n",
      "tokenizer config file saved in ./model_save/pre/tokenizer_config.json\n",
      "Special tokens file saved in ./model_save/pre/special_tokens_map.json\n"
     ]
    }
   ],
   "source": [
    "\n",
    "loss_log = pd.DataFrame(trainer.state.log_history)\n",
    "loss_log.to_csv(f\"./logs/pre_train_log_{time.strftime('%Y%m%d-%H%M')}.csv\")\n",
    "\n",
    "\n",
    "trainer.save_model(model_save_dir)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py310",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
